Work within the W3C Internationalization Activity and its Benefit for the Creation and Manipulation of Language Resources
نویسنده
چکیده
This paper introduces ongoing and current work within Internationalization (i18n) Activity, in the World Wide Web Consortium (W3C). The focus is on aspects of the W3C i18n Activity which are of benefit for the creation and manipulation of multilingual language resources. In particular, the paper deals with ongoing work concerning encoding, visualization and processing of characters; current work on language and locale identification; and current work on internationalization of markup. The main usage scenarios is the design of multilingual corpora. This includes issues of corpus creation and manipulation. 1. Background: Internationalization and W3C 1.1. What is Internationalization? According to (Ishida and Miller, 2006), internationalization is the process of making a product or its underlying technology ready for applications in various languages, cultures and regions. The acronym of Internationalization is used as “i18n” because there are 18 characters between the first character “i” and the last character “n”. Closely related to internationalization is localization, which is the process of adapting a product and technology to a locale, that is, a specific language, region or market. The concept “locale” will be described in more detail below. The acronym “l10n” is used because there are ten characters between the first character “l” and the last character “n”. (Sasaki, 2005) and (Phillips, 2006) demonstrate that internationalization is not a specific feature, but a requirement for software design in general. “Software” can be a text processor, a web service or a linguistic corpus and its processing tools. For each design target, there are different internationalization requirements. 1.2. Internationalization within the W3C One task of the Internationalization Activity within the World Wide Web Consortium (W3C) is to review a great variety of emerging W3C technologies with respect to internationalization issues1. During this work, ongoing topics like character encoding, visualization and processing have to be taken into account for many technologies. This article will describe the relevance of these issues for the design target “multilingual, textual corpus”. Besides reviewing emerging technologies, the Internationalization Activity is developing technologies itself. This article focuses on two work items, which are of direct relevance for the creation and processing of language resources: A standard for the identification of languages and An overview of past reviews can be found at http://www.w3.org/International/reviews/ . locales, and markup for internationalization and localization purposes. 2. Ongoing Topics of Internationalization 2.1. Creating a Corpus: Character Encoding Issues It will be assumed that the design target is a “multilingual, textual corpus”. The corpus should contain existing and yet to be created data in various languages. In the past, multilingual corpora like the Japanese, English and German data in Verbmobil (Wahlster, 2000) have been created relying only on the ASCII character repertoire. Today the usage of the Unicode (Aliprand et al., 2005) character repertoire is common sense, for corpus and many other kinds of textual data. The Basic Multilingual Plane of Unicode encompasses characters from many widely used scripts, which solves basic problems of multilingual corpus design. However, using Unicode does not solve all problems. Still various decisions have to be made: what encoding form is suitable, how characters not in Unicode are handled, or how to deal with “glyph” variants (see below). The encoding form is the serialization of characters in a given base data type. The Unicode standard provides three encoding forms for its character repertoire: UTF-8, UTF16 and UTF-322. If the multilingual corpus contains only Latin based textual data, UTF-8 will lead to a small corpus size, since this data can be represented mostly with one byte sequences. If corpus size and bandwidth are no issues, UTF-32 can be used. However, especially for web based corpora, UTF-32 will slow down data access. UTF-16 is for environments which need both efficient access to characters and economical use of storage. Unicode encodes widely used scripts and unifies regional and historic differences. Such differences are described as glyphs. Unicode unifies many glyphs into singular characters. The most prominent example for the unification of UTF-8 encodes characters as sequences of a variable length: one, two, three or four bytes. UTF-16 uses variable sequences of one or two double bytes. UTF-32 is a character serialization with a fixed length of four bytes.
منابع مشابه
Reasons for the Creation of the New Coronavirus 2019 (SARS-CoV2): Natural Mutation or Genetically Laboratory Manipulation-Point of View
Background and Objectives: Following the emergence of COVID-19 caused by the SARS-CoV2, the reasons for the emergence of the novel virus have been the subject of interest for molecular biology researchers and news agencies. This article attempted to emphasize all aspects of the emergence of this virus and discuss the latest information related to its development. Materials and Methods: From th...
متن کاملSchema Languages & Internationalization Issues: A survey
Many XML-related activities (e.g. the creation of a new schema) already address issues with different languages, scripts, and cultures. Nevertheless, a need exists for additional mechanisms and guidelines for more effective internationalization (i18n) and localization (l10n) in XML-related contents and processes. The W3C Internationalization Tag Set Working Group (W3C ITS WG) addresses this nee...
متن کاملHuman Resources Development in Relation to International Students of Iran’s University
In the direction of the development of science and technology diplomacy and the production of soft power, the internationalization of higher education institutions has been attended by academic experts and leaders in recent decades. International higher education has become to certain policy of higher education development in developed and progressives countries as a symbol of higher education ...
متن کاملA long-term cost-benefit analysis of national anti-desertification plans in Iran
Desertification was recognized in Iran several decades ago. This phenomenon has gradually affected half the provinces in the country, where droughts exacerbate problems in these drylands. In response, the government has been active in providing considerable funds and human resources to halt desertification through investing in national research and executive projects over the last fifty ye...
متن کاملDesign and Validation of the Internationalization Model of Higher Education in Medical Sciences Universities
Introduction: Internationalization of higher education is an active and creative response towards the globalization phenomenon. The process of internationalization according to national upper documents in the country is assigned as a priority and necessity. The objective of the study is to design and evaluate the model of internationalizing higher education in the state universities of medical ...
متن کامل